Method and device for detecting starting and ending points of sound segment in video

ABSTRACT

An envelope arithmetic device for determining arithmetically an envelope of a sound signal waveform associated with video image signals inputted on a time-serial basis, a sound level threshold setting device for setting previously a threshold value of sound level for values of the above-mentioned envelope and a start/end point detecting device for detecting points at which the above-mentioned threshold level and the above-mentioned envelope intersect each other as the start and end points of the sound segment are provided for thereby arithmetically determining an envelope of a sound waveform associated with the video for detecting as the start point of the sound segment a point at which the value of the envelope exceeds the threshold of the sound level while detecting as the end point a point at which the value of the envelope becomes smaller than the threshold value. The interval of the video corresponding to the start point and the end point is registered in terms of a number identify a frame constituting a part of the motion pictures.

TECHNICAL FIELD

The present invention relates to a method and an apparatus for detectingsound segments of audio data associated with moving pictures such as avideo program recorded on a video tape or a disk, and is concerned witha method and an apparatus which can simplify indexing of a leadingposition of an audio sequence or interval in a video program.

BACKGROUND TECHNIQUES

With the advent of high-speed computers and availability of memorydevices or storages of large capacity in recent years as the background,it becomes now possible to handle a mass of moving pictures andassociated audio information through digitization thereof. Inparticular, in the field of the editing of moving pictures andmanagement thereof, the digitized moving pictures can be handled orprocessed by the pick-up device and the editing apparatus as well as themanaging apparatus for production of video programs. As one of theseapparatuses, there can be mentioned a CM managing apparatus (also knownunder the name of CM bank) which is designed for managing severalthousand varieties of commercial video segments (video clips)(hereinafter also referred to as the CM or CM video) for preparing givenCM videos (video clips) in the order for broadcasting. Heretofore, aplurality of CM video materials have been recorded on a single videotape before broadcasting. In these years, such a CM managing apparatuscan also be made use of which is designed for broadcasting the CM videomaterials supplied from producers thereof such as advertizing agencies.The CM video materials have been supplied individually on aprogram-by-program basis in the form of video tapes, respectively,wherein video supplied as the mother material contains the name oridentifier of the producer and data concerning the production inaddition to the intrinsic CM video entity. Further, so-called idlepictures are inserted, respectively, in precedence and in succession tothe CM video for several seconds for the purpose of realizing alignmentin timing upon the broadcasting. Such being the circumstances, therearises necessity of registering a start and an end of the CM video(clip) to be broadcast in addition to the storage of the mother materialsupplied from the producer on another recording medium such as a tape,disk or the like by copying.

The work for checking the start and the end of the CM video is currentlycarried out thoroughly manually, which has imposed an heavy burden onthe operator in charge. Because the idle pictures are taken,respectively, in continuation to the start and the end of the intrinsicCM video entity, the operator often encounters such situation that theextent of the CM video to be really broadcast can not be discernedmerely through visual observation or check. In the case of the CM videoor the like which is constituted by a combination of audio and video,the operator determines discriminatively the start and the end of thevideo by checking auditorily the sound in the idle intervals in thevideo (clip) because no sound is recorded in the idle intervals. In thepresent state of the art, there is unavailable any other method than theone in which the operator decides auditorily the presence or absence ofsound by repeating manipulation such as reproduction or play of thevideo, stoppage or pause, reverse reproduction or reverse play, etc.These manipulations are certainly improved by adopting a dial such as ajog, a shuttle or the like in the video reproducing apparatus or bymaking use of a scroll bar on an image screen of a computer. However,such manipulations still incur not a little consumption of man power.

With the present invention, it is contemplated as an object thereof toprovide a method and an apparatus which make it possible to automate thework involved in deciding auditorily the presence or absence of sound atthe start and the end of a CM video (clip) upon registration of CM videomaterial while automating operation for the registration forsimplification thereof.

Another object of the present invention is to provide a method and anapparatus for detecting the start and end points of an intrinsic CMvideo entity on a real-time basis for registering the positions of thestart and end points, respectively.

DISCLOSURE OF THE INVENTION

In an interactive registration processing for registering a video in avideo managing apparatus, it is taught according to the presentinvention to provide an envelope arithmetic means for determiningarithmetically an envelope of waveform of a sound signal inputted on atime-serial basis, a sound level threshold value setting means forsetting previously a threshold value of sound level for comparison withvalues of the envelope, and a start/end point detecting means fordetecting a time point at which the envelope intersects the level of theaforementioned threshold value as a start point or an end point of asound segment, to thereby allow the presence or absence of the sounddetermined heretofore with the auditory sense to be decidedquantitatively and automatically. In that case, the start/end pointdetecting means mentioned above is provided with a silence time durationlower limit setting means for setting previously a lower limit on theduration of a silence state, a silence time duration arithmetic meansfor determining arithmetically an elapsed time during which the value ofthe envelope of the sound signal waveform has remained smaller than thethreshold value of the sound level, and a silence time duration decisionmeans for deciding that the above-mentioned silence time duration hasexceeded the lower limit so that sound interruption of extremely shortduration such as punctuation between phrases in a speech can be excludedfrom the detection. Similarly, the start/end point detecting meansmentioned above is provided with a sound time duration lower limitsetting means for setting previously a lower limit on the duration of asound state, a sound time duration arithmetic means for determiningarithmetically an elapsed time during which the value of the envelope ofthe sound signal waveform has exceeded the threshold value of the soundlevel, and a sound time duration decision means for deciding that thesound time duration has exceeded the lower limit so that noise or soundof one-shot nature can be prohibited from being detected. Furthermore,the envelope arithmetic means mentioned above is provided with afiltering means for performing a filtering processing having apredetermined constant time duration on the sound signal inputted on atime-serial basis. As the filtering means mentioned above, a maximumvalue filter for determining sequentially maximum values of apredetermined constant time duration for the sound signal inputted on atime-serial basis and a minimum value filter for determiningsequentially minimum values of a predetermined constant time durationfor the sound signal inputted on a time-serial basis are employed.

Furthermore, it is taught according to the resent invention that a videoreproducing means for reproducing a video material, a sound input meansfor inputting a sound signal recorded on an audio track of the video forreproduction as a digital signal on a time-serial basis, and a soundprocessing means for detecting the start and end points of a soundsegment from the sound signal as inputted, and a display means fordisplaying results of the detections are provided, for thereby enablingthe position of the start and end points of the sound segment in thevideo material to be presented to an operator. The sound processingmeans is provided with a frame position determining means fordetermining the frame positions of the video at the time points at whichthe start and end points the sound interval are detected in addition tothe envelope arithmetic means, the sound level threshold value settingmeans and the start/end point detecting means mentioned previously. Theframe position determining means mentioned above is provided with atimer means for counting the elapsed time, starting from the beginningof the detection processing, a means for reading out the frame positionsof the video (or moving pictures), an elapsed time storage means forstoring elapsed time at the time points at which the start and endpoints mentioned above are detected and elapsed time at a time point atwhich the frame position mentioned above is read out, and a frameposition correcting means for correcting the frame position as read outby using difference between both the elapsed times mentioned above sothat a time lag involved in the detection of the start and end points upto the reading of the frame position can be corrected to thereby allowthe frame position to be determined at the detection time point.Furthermore, the sound processing means mentioned above is provided witha means for stopping temporarily the reproduction of the video at thestart and end points as detected, to thereby enable the reproduction ofthe video to be paused at the frame positions corresponding to the startand end points. In that case, a video reproducing apparatus capable ofcontrolling the reproduction of the video by a computer is employed asthe video reproducing means. By way of example, a video deck equippedwith a VISCA (Video System Control Architecture) terminal, a video deckused generally in the editing by the professional or the like may beemployed. In this way, head indexing to the sound segment as detectedcan be realized efficiently.

Furthermore, it is taught according to the present invention that thesound processing means mentioned previously is provided with a frameposition storage means for storing individually the frame positions ofthe start point and the end point of the sound segment, and a displaymeans for displaying individually the frame positions of the start pointand the end point so that the positions of the start point and the endpoint of the sound segment in the video material can be presentedindividually to the operator. Besides, the sound processing means isprovided with a buffer memory means for storing sound signals inputtedtime-serially on a constant time-duration basis and a reproducing meansfor reproducing the sound signals as inputted so that the operator canconfirm visually and auditorily the sound interval as detected.Furthermore, on the assumption that the picture subjected to theprocessing is a CM video material and that such a general rule that theCM video entity has a time duration of 15 seconds or 30 seconds per CMprogram made use of, the sound processing means mentioned above isprovided with a time duration setting means for setting previously anupper limit of the length of time duration of the sound segment having apredetermined constant time duration together with a tolerance range ofone or two seconds and a time duration comparison means for comparingthe length of a detected time duration extending from the start point tothe end point of the sound segment as detected with the set timeduration length mentioned above for thereby allowing only the soundsegment of a predetermined constant time duration to be detected in a CMvideo (clip). Additionally, the sound processing means is provided witha margin setting means for setting margins at front and rear sides,respectively, of the sound segment as detected so that the CM video(clip) for broadcasting which has the predetermined time duration can beregistered in the CM managing apparatus from the CM video material.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a diagram showing a system configuration for realizingembodiments of the present invention,

FIG. 2 is a conceptual view for illustrating a method of detecting asound segment according to the present invention,

FIG. 3 is a flow chart for illustrating the method of detecting thesound segment according to the present invention,

FIG. 4 is a view for illustrating the conditions for deciding the startand end points of a sound segment according to the present invention,

FIG. 5 is a view for illustrating an example of a screen image formanipulation,

FIG. 6 is a flow chart for illustrating flow of processings on thewhole,

FIG. 7 is a view showing a control scheme of detection of the soundsegment according to the present invention,

FIG. 8 is a view for illustrating positional relationship between inputand output data in a filtering processing,

FIG. 9 is a flow chart for illustrating a flow of sound segmentdetection processing in which rules concerning time duration of a CMpicture are adopted, and

FIG. 10 is a view showing examples of data structures for realizing thesound segment detection according to the present invention.

BEST MODES FOR CARRYING OUT THE INVENTION

In the following, exemplary embodiments of the present invention will bedescribed by reference to the drawings.

FIG. 1 is a diagram showing an example of a system configuration forimplementing the present invention. Reference numeral 101 denotes adisplay device such as a CRT or the like for displaying output of ansound processing unit 104. Inputting or setting of commands, thresholdvalues and others for the sound processing unit 104 is carried out byusing an input unit 105 which includes a pointing device such as a mouseor the like and a numeric value input device such as a ten-key array orthe like. A picture reproducing apparatus 110 is an apparatus which isdesigned for reproducing pictures recorded on a video tape, an opticaldisk or the like. A sound signal associated with a video reproduced andoutputted by the picture reproducing apparatus 110 sequentiallyundergoes conversion to a digital signal by a sound input unit 103, thedigital signal being then inputted to the sound processing unit 104.Further, information such as a sampling frequency and a sampling bitnumber used in the conversion to the digital signal, and the channelnumber indicating monophonic or stereophonic (monophonic beingrepresented by “1” with the stereophonic by “2”) and others istransferred to the sound processing unit 104 from the sound input unit103. Of course, the above information may be supplied to the sound inputunit 103 from the sound processing unit 104 as the numeric values set inthe sound processing unit 104. The sound processing unit 104 processesthe signals as received to thereby control the picture reproducingapparatus 110. Transmission and reception of control commands andresponses between the sound processing unit 104 and the videoreproducing apparatus 110 are carried out via a communication line 102.In the case where individual frames of the video handled by the videoreproducing apparatus 110 are allocated with frame numbers (time codes)in a sequential order, starting from the leading frame of the video, theimage of a given frame number can be retrieved by sending the relevantframe number and a search command to the video reproducing apparatus 110from the sound processing unit 104. Similarly, the sound processing unit104 can also receive the current frame number of the video from thevideo reproducing apparatus 110 by issuing the relevant request to thelatter. Internally of the sound processing unit 104, the digital signalof sound is once loaded to a memory 109 via an interface 108 andprocessed by a CPU 107 in accordance with a processing program stored inthe memory 109. The processing program is stored in an auxiliary storageunit 106 and transferred to the memory 109 optionally in response to thecommand issued by the CPU 107. A variety of data generated throughprocessings described hereinafter is stored accumulatively in the memory109 and can be referenced as occasion requires. The sound digital signaland various information such as information resulting from processingsand the like can also be stored in the auxiliary storage unit 106. Aloudspeaker 111 reproduces the sound signal inputted to the soundprocessing unit 104 from the sound input unit 103 synchronously with theinputting as well as the sound signal stored in the memory 109 inresponse to the user's demand.

In the following, description will be directed firstly to a method ofdetecting sound segments associated with a video, which method allowsthe user to detect easily the sound segments in the video whileconfirming or observing the video. In succession, description will bemade of a sound segment detecting apparatus which is realized byadopting the method mentioned above, which will be followed by thedescription concerning a method of finding a broadcasting-destined CMvideo of a predetermined constant time duration from a CM videomaterial.

FIG. 2 is a schematic diagram for illustrating schematically the methodof detecting the sound segment contained in the picture according to thepresent invention.

Motion pictures 201 and a sound waveform 202 represents illustrativelysignals of image and sound, respectively, contained in a video. Althoughthe sound waveform 202 is shown as being monophonic for simplificationof the description, it may be stereophonic. In the case where the videoof concern is a CM video material, idle pictures each of several-secondduration are inserted in precedence and succession to an intrinsic CMvideo entity. Ordinarily, the idle pictures are photographedcontinuously in precedence and in succession to the intrinsic CM videoentity and same as the leading and trailing images (frames),respectively, of the latter. Consequently, in many cases, difficulty orimpossibility is encountered in discerning the CM video to be broadcaston the basis of observation of only the motion pictures 201. In the idlepicture intervals, however, no sound is recorded. Such being thecircumstances, the head and the end of the intrinsic CM video entityhave heretofore been determined by the operator by deciding the presenceor absence of the sound in the picture while repeating operations suchas forward play, stop, reverse play and the like. According to thepresent invention, it is taught to automate the decision based on theauditory sense such as mentioned above by detecting the sound segment.

In the sound waveform 202, amplitudes of plus and minus values makeappearance alternately and frequently and may assume instantaneouslymagnitude of zero very frequently. Accordingly, solely with the check ofmagnitude of the amplitude at a given moment, the presence or absence ofthe sound around that time point can not always be discerned. Accordingto the instant embodiment, magnitude of the sound is determined on thebasis of values of an envelope of the sound waveform 202. A value of theenvelope can represent reflectively the presence or absence of the soundaround that value. A point at which the value of the envelope exceeds athreshold value of a predetermined sound level is detected as the startpoint (IN) of the sound segment 203 while a point at which the envelopevalue becomes smaller than the threshold value is detected as an endpoint (OUT). By storing the sound data string from the start point tothe end point in the memory 109 or the auxiliary storage unit 106 andreproducing the data, confirmation or discernment of the contents of thesound in the sound segment 203 can also easily be realized. Thepositions in the video corresponding to these detection points can bedetermined in terms of frame numbers. At the time points when thetransition point such as the start point or end point of the soundsegment 203 is detected, the video which succeeds to the transitionpoint has already been reproduced by the video reproducing apparatus110. Accordingly, the frame number corresponding to the detection timepoint is read out or fetched from the video reproducing apparatus 110,whereon the frame number corresponding to the transition point isderived by using difference between the time point at which the framenumber was read out from the video reproducing apparatus 110 and thetime point at which the transition point occurred, to thereby determinearithmetically the frame number corresponding to the transition point. Amethod of deriving or determining the frame number will be elucidatedlater on by referring to FIG. 7. By detecting the sound segment bymaking use of the envelope and establishing correspondence between theoriginal video and the sound interval by making use of the frame number,the picture interval during which the sound continues to exceed a givensound level can be extracted. Further, by sending the frame number ofthe start point together with a search command to the video reproducingapparatus 110, head indexing of the frame in which the sound rises upcan easily be realized. Furthermore, since the time duration extendingfrom the start point to the end point can be known, setting of marginsrequired for making up the CM video for the broadcasting before andafter the picture video segment as extracted can easily be realized. Inthis manner, the CM video (clips) of high quality suffering nodispersion in the time duration can be registered in the CM managingapparatus.

By virtue of the teachings of the present invention, the user who usesthe system shown in FIG. 1 is required only to load a video tape or thelike having video materials recorded thereon in the video reproducingapparatus 110 and manipulate buttons on a console of the soundprocessing unit 104 displayed on the display device 101. An example ofscreen image of the console will be explained later on by reference toFIG. 5. The user can thus get rid of the work for finding out the headand the end of the sound segment associated with the video throughmanual operation of a jog, a shuttle or the like. Thus, the operation ormanipulation can be simplified, to an advantageous effect.

Next referring to FIGS. 3 and 4, the sound segment detecting method willbe described in detail.

FIG. 3 is a flow chart for illustrating a method of detecting the startand end points of a sound segment associated with a video according tothe present invention.

Reference numerals 301 to 306 designate program steps, respectively, and311 to 316 designate output data of the individual steps, respectively.These programs and data are all placed on the memory 109 to be executedor processed by the CPU 107. Although the sound waveform is shown asbeing monophonic (channel number is “1”) for simplification of thedescription, similar procedure may be taken equally even in the case ofa stereophonic sound (channel number is “2”). In the case of thestereophonic sound, the processings for the monophonic sound describedbelow may be executed for each of the sound waveforms of the left andright channels, whereon the results of the processings for both thechannels may be logically ANDed (determination of logical product) tothereby make decision as to overlap therebetween or alternativelylogically ORed (determination of logical sum) for the decision as awhole.

At first, in the step 301, audio data associated with the video isreceived from the sound input unit 103. Reference numeral 311 designateswaveform of the sound data as received. In the step 302, absolute valuesof individual data carried by the sound waveform 311 are determined tothereby execute fold-up processing for the sound waveform, because onlythe sound level is of concern regardless of the contents or implicationof the sound. Reference numeral 312 designates a sound waveformresulting from the processing for folding up the sound waveform 311 tothe plus side. Subsequently, in the steps 303 and 304, an envelope ofthe waveform 312 is determined through maximum/minimum type filterings.To this end, filters of filter sizes 321 and 322 are prepared for therespective filterings, and the input data are sequentially fetched intothe filters for thereby determining the maximum value and the minimumvalue in the filters to be outputted. In the step 303, the maximum valuein the filter is outputted for the waveform 312 on a data-by-data basis.In the step 304, the minimum value in the filter is outputted for themaximum-value waveform 313 on a data-by-data basis. Reference numeral314 designates envelopes obtained as the result of the filteringprocessings. In the step 305, a threshold processing is performed forcomparing the individual data of the envelopes 314 with a thresholdvalue 323 predetermined for the sound level. When the envelope 314exceeds the threshold value 323, “1” indicating the presence of sound isoutputted, while “1” indicative of the absence of sound is outputtedwhen the envelope is short of the threshold value. Reference numeral 315designates binary data of the sound and the silence outputted from theprocessing step 305. Finally, in the step 306, the sound waveform 312 ischecked as to the continuity of sound and silence on the basis of thebinary data 315 for detecting a sound segment 324, whereon start and endpoints 316 of the sound segment are outputted. More specifically, therise point of the sound interval is outputted as a start point 325 (IN)of the sound while the fall point of the sound interval is outputted asan end point 326 (OUT) of the sound. Concerning this step 306,description will be made by referring to a timing chart shown in FIG. 4.

The method of arithmetically determining the envelope through themaximum/minimum type filtering can be realized with remarkably reducedcomputation overhead when compared with a method of calculating thepower spectrum of the sound waveform to thereby determine the power ofdegree zero as the envelope. Accordingly, the method described above canbe carried out even with the CPU whose capability or performance is notso high.

As the one-dimensional maximum/minimum type filtering described above inconjunction with the steps 303 and 304, there may be adopted thefiltering procedure described, for example, in “HIGH-SPEED ARITHMETICPROCEDURE FOR MAXIMUM/MINIMUM TYPE IMAGE FILTERING” (The Institute ofElectronics, Information and Communication Engineers of Japan, ThesesCollection D-II, Vol. J78-D-II, No. 11, pp. 159-1607, November, 1995).This procedure is a sequential data processing scheme which can berealized by making use of a ring buffer capable of storing (n+1) datafor a filter size n. With this procedure, the maximum value and theminimum value can be determined by performing arithmetic operation aboutthree times for one data on an average, regardless of the nature of thedata and the filter size. Accordingly, this procedure is suited for theapplication where a large amount of data has to be processed at highspeed as in the instant case.

FIG. 4 is a view for illustrating a method of deciding the start and endpoints of a sound segment.

For making decision as to the start/end point of a sound segment, theconditions for the start/end point decision are defined as follows:

start point: the point at which state transition occurs when the soundstate has continued longer than Ts inclusive after the silence state hadcontinued longer than Tn inclusive, and

end point: the point at which state transition occurs when the silencestate has continued longer than Tn inclusive after the sound state hadcontinued longer than Ts inclusive,

where Ts [msec] represents a lower limit for the length of elapsed timeof the sound state, and Tn [msec] represents a lower limit for thelength of elapsed time of the silence state. Values of Ts and Tn maypreviously be set with reference to the time duration of one syllable ofspeech and/or the time duration of a pause intervening between auralstatements. In this way, the sound state of a duration shorter than Tsas well as the silence state shorter than Tn can be excluded from thedetection. Thus, there can be realized a stable or reliable soundsegment detecting method which is insusceptible to the influence of thesound interruption of extremely short duration such as one-shot noise,punctuation between phrases in a speech.

Reference numeral 401 designates generally a timing chart forillustrating a process until the start and end points 316 of a soundinterval is determined from the input data 315 in the step 306. As flagsfor discriminatively identifying the states, there are provided fourflags, i.e., a silence flag, a sound flag, a start flag and an end flag.

In the step 306, the input data 315 indicating the binary states ofsound and silence are checked sequentially, whereon the numbers of data“0” (silence) and “1” (sound) are counted, respectively, for determiningthe elapsed times of the sound and silence states, respectively. Sincethe sampling frequency for digitizing the sound signal has beentransferred to the sound processing unit 104 from the sound input unit103, the time conditions Ts and Tn can easily be replaced by theconditions given in terms of the number of data. Parenthetically, thedata number representative of the sound state is cleared at a time pointwhen the silence flag is set “ON”, while the data number representativeof the silence state is cleared at a time point when the sound flag isset “ON”. At the beginning, all the flags are set “OFF” and the datanumbers of both the states are set “0”. At first, the silence flag isset “ON” at a time point when the silence state has continued for Tn(402). When the silence flag is “ON”, the points at which transition tothe sound state from the silence state occurs are all selected as thecandidates for the start point and the relevant data positions arestored in the memory 109. At first, the rise of a sound state 403 isfetched as a candidate for the start point of the sound state. However,since the elapsed time of the sound state 403 is short of Ts, the datanumber for the sound state 403 is classified as the data number (elapsedtime) for the silence state to be rejected as noise of one-shot nature.Subsequently, the rise of a sound state 404 is fetched as a candidatefor the start point, and the sound flag is set “ON” when the sound statehas continued for Ts (405). Thus, both the silence flag and the soundflag are now set “ON” to satisfy the conditions for identifying thestart point. Accordingly, the start flag is set “ON”, and a start point325 (IN) is determined. The start flag set “ON” is reset “OFF” at a timepoint when it is sensed. The start point detecting procedure describedabove is performed up to a point 420 on the time axis.

Upon ending of the detecting procedure for the start point, a detectingprocedure for the end point is started in continuation. At first, thesilence flag is set “OFF” (406). When the sound flag is “ON”, the pointsat which transition to the silence state from the sound state occurs areall selected as the candidates for the end point, and relevant datapositions are stored in the memory 109. Since the elapsed time of thesilence state 407 is shorter than Tn, the data of the silence state 407is switched into a sound state and merged (put) into the sound states infront and behind to be ignored as a silence interval of a bit time.Subsequently, the silence flag is set “ON” when the silence state 408has continued for Tn (409). Thus, both the sound flag and the silenceflag are now set “ON” to satisfy the conditions for identifying the endpoint. Accordingly, the end flag is set “ON”, and the end point 326(OUT) is determined. The end flag which is set “ON” is reset “OFF” at atime point when it is sensed. Further, the sound flag is also set “OFF”for preparation for the succeeding start point detecting procedure(410). The end point detecting procedure described above is performed upto a point 421 on the time axis.

By manipulating the flags as described above by reference to FIG. 4, thestart and end points of the sound segment can be successively detected.Even when a plurality of sound segments are provided in association withone video, each of the individual sound segments can be detectedindividually. Thus, the sound interval detecting method according to thepresent invention can find application not only to the CM videomaterials and the video programs but also other videos in general suchas those for TV broadcasting, archive video and the like. Furthermore,in the case where the picture subjected to the processing is a CM videomaterial, such a general rule concerning the time duration of the CMvideo that “CM clip is to be realized with a time duration of 15 secondsor 30 seconds per CM entity” can be adopted. Thus, even when a pluralityof sound segments are detected, these sound segments can be combinedtogether into one set in accordance with the above-mentioned rule forthe CM video, whereby the proper start and end points of the intrinsicCM video entity can be determined. Concerning the start/end-pointdetecting method in which the rule concerning the CM video is adoptedwill be described later on by reference to FIG. 9.

Now, description will be directed to a sound segment detecting apparatusrealized by making use of the sound interval detecting method describedabove.

FIG. 5 shows an example of a screen image for manipulation or operationof a sound segment detecting apparatus realizing the teachings of thepresent invention. A manipulation window 501 is displayed on the displaydevice 101 as a console of the sound processing unit 104 to present theenvironment for manipulation to the user. Within the manipulation window501, there are disposed a QUIT button 502, a DETECT button 503, adetection result display panel 504, a sound waveform monitor 505, asound interval display panel 506, a PLAY button 509, a video reproducingapparatus manipulation panel 510 and a parameter setting panel 513. Theuser can input to the sound processing unit 104 his or her command orrequest by clicking a relevant command button disposed on themanipulation window 501 with a mouse of the input unit 105. The QUITbutton 502 is a command button for inputting a command for closing themanipulation window 501 by terminating the manipulation processing.

The DETECT button 503 is a command button for executing the soundsegment detection processing. When the DETECT button 503 is clicked bythe user, the sound processing unit 104 clears the detection resultdisplay panel 504 and then starts detection of the sound segment inaccordance with the program 300, wherein interim result of theprocessing which is being executed is displayed on the sound waveformmonitor 505. Displayed on a sound waveform monitor 505 are the envelope314 determined arithmetically and the threshold value 323 for the soundlevel. Upon detection of the start and end points of a sound segment,the frame numbers as detected are displayed on the panel 504 each interms of a time code of a structure “hh:mm:ss:ff” (hh: hour, mm: minute,ss: second and ff: frame), which is convenient for the user becauseposition and length can be grasped intuitively.

Displayed on the sound interval display panel 506 are a waveform 507 anda sound interval 508 of sound data which have been inputted before thestart and end points of the sound segment are detected. The soundsegment 508 corresponds to a period from an IN frame to an OUT frame onthe detection result display panel 504. Because the time duration of theCM video (clip) is in general 30 seconds at the longest per one CMentity, it is presumed in the instant case that the sound waveformhaving a time duration of 40 seconds is displayed. The PLAY button 509is a button for reproducing the sound data of the sound segment 508. Theuser can visually observe the sound signal associated with the videowith the aid of the sound data waveform 507. Besides, by clicking thePLAY button 509 to thereby reproduce the sound, the sound data can alsobe auditorily confirmed. In this way, the user can ascertain the resultof detection immediately after the detection of the sound segment. Thus,the confirmation work can be much simplified.

When the user desires to provide the sound segment with margins, thiscan be accomplished by widening the interval by dragging the ends oredges of the sound segment 508. Because the start and end points of thesound segment are already known as displayed on the detection resultdisplay panel 504, the duration or length of the interval can bearithmetically determined. The user can provide the relevant soundsegment with leading and trailing margins so that the time duration ofthe whole interval inclusive of the margins becomes equal to the desiredlength. The system alters the frame numbers displayed on the detectionresult display panel 504 in accordance with the length of the margins asaffixed, whereon the altered frame numbers are set as the start and endpoints of the CM video (clip) to be registered in the CM managingapparatus. In this way, the user can easily proceed with theregistration work for the CM managing apparatus. Additionally, bycutting out the video sandwiched between the start and end points of thevideo for the purpose of registration, the user can prepare a CM video(clip) for broadcasting which has a desired length.

Disposed on the video reproducing apparatus manipulation panel 510 is aset of video reproducing apparatus manipulation buttons 511. Themanipulation button set 511 includes command buttons for executing thefast forwarding, rewinding, play, frame-by-frame steeping, pause, and soon. When the user clicks a desired one of the command buttons in the setof video reproducing apparatus manipulation buttons 511, the soundprocessing unit 104 sends the relevant manipulation command to the videoreproducing apparatus 110. The frame position of the video is displayedwithin the frame position display box 512 in the form of a time code.

Disposed on the parameter setting panel 513 is a parameter setting box514 for setting parameters for the sound interval detection. Arrayed inthe parameter setting panel 513 as the changeable parameters are fourparameters, i.e., the threshold value (Threshold Value) of the soundlevel, time duration length (Filter length) of the filter, lower limitof the length of the elapsed time of the sound state (Noise Limt) andlower limit of the length of the elapsed time of the silence state(Silence). When the user desires to change the parameters, he or she mayclick the parameter setting box 514 and input relevant numeric valuesthrough the input unit 105. For setting the threshold value (ThresholdValue in the figure) of the sound level, the threshold value can be setthrough another procedure described below in addition to the inputtingof the relevant value through the input unit 105. At first, when theparameter setting box for the threshold value of the sound level isclicked, the picture reproducing apparatus 110 is stopped or set to thepause. In this state, sound data is inputted to the sound processingunit 104 from the sound input unit 103 for several seconds.Subsequently, the maximum value of the sound level of the sound datainputted for several seconds is selected as the threshold value of thesound level. By inputting the sound data for several seconds, randomnoise of the sound signal generated in the video reproducing apparatus110 and the sound input unit 103 can be inputted to the sound processingunit 104. Furthermore, by setting the maximum value of the noisementioned above as the threshold value of the sound level, the inputtedsound signals associated with the video can be protected from theinfluence of noise generated in the video reproducing apparatus 110 andthe sound input unit 103.

FIG. 6 is a flow chart for illustrating flow of processings on thewhole. In response to a program activation request inputted by a user,the CPU 107 reads out a program 600 from the auxiliary storage unit 106,which program is then placed on the memory 109 for execution. At thattime, various sound data and processed data are also stored in thememory 109. Concerning the structure of these data, description will bemade later on by reference to FIG. 10.

In a step 601, an initialization processing is executed upon starting ofthe processing. At the beginning, the CPU 107 allocates a memory arearequired for the processing on the memory 109 and clears it, whereon theCPU sets default values of the parameters such as the threshold value ofthe sound level and others. Subsequently, the manipulation window 501 ofthe sound processing unit 104 is displayed on the display device 101.Further, the setting for communication with the video reproducingapparatus 110 is initialized to open a communication port. Insuccession, the CPU sends a control command to the video reproducingapparatus 110 to set the reproducing operation of the picturereproducing apparatus 110 to the pause state (STAND BY ON). By settingthe video reproducing apparatus 110 to the pause state instead of thestopped state, the video reproducing apparatus 110 can be put intooperation instantaneously in response to another control command, whichmeans that the sound signal and the frame number can be read outrapidly.

In a step 602, presence or absence of an end request issued by the useris decided. So long as the end request is not issued, the screen imagecontrol of the step 603 is executed repetitively.

In a step 603, processing procedure is branched in correspondence to acommand button designated by the user. By way of example, when the userclicks the DETECT button 503 of the manipulation window 501, steps 608and 609 are executed, whereupon inputting by the user is waited for. Byincreasing or decreasing the number and the variety of the commandbuttons disposed within the manipulation window 501, the number ofbranches as well as that of decisions as to the branching may beincreased or decreased correspondingly, whereby most suitable processingcan always be selected properly.

In steps 604 to 609, processings which correspond to the individualcommand buttons, respectively, are executed.

In the step 604, in response to designation of the button in the set ofpicture reproducing apparatus manipulation buttons 511, the processingcorresponding to the designation is executed. This control processingcan also be made use of as the processing for controlling the picturereproducing apparatus 110 in addition to the processing executed whenone of the picture reproducing apparatus manipulation buttons 511 isclicked. At first, a control command is sent to the video reproducingapparatus 110 to receive a response status from the video reproducingapparatus 110. Subsequently, decision is made as to the response status.When error occurs, an error message is displayed on the display device101 with the processing being suspended. When the control can beperformed normally, the frame number is read out to be displayed in thedisplay box 512, whereon return is made to the step 603.

In a step 605, parameter setting processing is executed in response todesignation of the parameter setting box 514. When the parameter as setis altered in response to the input of a numeric value by the userthrough the input unit 105, the relevant parameter stored in the memory109 is rewritten. Further, when the parameter concerning the timeduration is altered, the time duration is converted into the data numberin accordance with the sampling frequency of the (digitized) sound data.

In a step 606, a sound reproducing processing is executed forreproducing inputted sound data of the detected sound interval 508. Whenthe start and end points of the sound interval are set in the detectionresult display panel 504, the sound data from the IN frame to the OUTframe displayed on the detection result display panel 504 is reproduced.In other words, the sound data stored in a sound data storing ringbuffer 1050 is reproduced over a span from a start point data position1052 to an end point data position 1053. In this way, the user canauditorily check the result of the detection.

In a step 607, a margin setting processing is executed for providing thedetected sound segment with margins. The user drags the ends of thesound interval 508 to thereby widen the interval, whereby the marginscan be set. At first, the time duration of the sound segment extendingfrom the IN frame to the OUT frame displayed on the detection resultdisplay panel 504 is arithmetically determined. By setting previouslythe length of the time duration of every CM video (clip) to be constant,the upper limit of the margin can be determined definitely on the basisof the length of the time duration of the relevant sound segment. Themargin is determined while supervising the manipulation of the user sothat the upper limit is not exceeded, and the frame numberscorresponding to the start and end points are corrected. Through thisprocedure, the CM video of high quality which suffer no dispersion inrespect to the time duration can be registered in the managingapparatus. As an alternative procedure, appropriate margins which meetthe upper limit condition may be automatically affixed to the leadingand trailing ends, respectively, of the interval. Unless limitation isimposed on the time duration length, the margin can be affixed inconformance with the user's request.

In a step 608, a processing for detecting the start and end points ofthe sound segment is executed. When the DETECT button 503 is designated,picture is reproduced by the picture reproducing apparatus 110 with thesound data being inputted from the sound input unit 103, whereon thestart and end points of the sound segment are detected to be displayedon the detection result display panel 504. For more details, descriptionwill be made later on in conjunction with a program 900 (FIG. 9).Parenthetically, the program 900 represents a typical case in which themethod of detecting the start and end points of the sound segment asillustrated in terms of the program 300 is applied to the sound segmentdetecting apparatus. In this conjunction, there may be mentioned analternative method according to which the video of the video reproducingapparatus 110 is indexed to the start point of the sound interval afterdetection of the start and end points of the sound segment. Such headindexing can be realized by sending the frame number indicating thestart point of the sound segment together with a search command to thevideo reproducing apparatus 110 from the sound processing unit 104.

In a step 609, the waveform 507 and the sound segment 508 are displayedon the panel 506. The sound data inputted until detecting both of thestart and end points of the sound segment is performed is displayed asthe waveform 507, while the period extending from the IN frame to theOUT frame displayed on the detection result display panel 504 isdisplayed as the sound segment 508. More specifically, the sound data ofthe sound data storing ring buffer 1050 are shifted one round, startingfrom an offset 1054, to thereby generate the waveform display.Additionally, the data interval sandwiched between the start point dataposition 1052 and the end point data position 1053 is displayed as thesound interval 508. In this way, the user can visually observe theresults of detection.

In a step 610, an end processing is executed. At first, a controlcommand is sent to the video reproducing apparatus 110 for setting thevideo reproducing apparatus 110 to the stopped state (STAND BY OFF), andthen the communication port is closed. Subsequently, the manipulationwindow 501 generated on the display device 101 is closed. Finally, theallocated memory area is released, whereupon the processing comes to anend.

Now, disclosed are a control scheme and a filtering processing schemewhich can be adopted for applying the sound segment start/end pointdetecting method described hereinbefore in conjunction with the program300 to the sound segment detecting apparatus.

According to the program 300, it is possible to detect the start and endpoints after having inputted the whole sound data associated with thevideo (clip). However, when the sound data of long time duration isinputted en bloc, processing of long-time sound data obstructs thereal-time detection of sound segments, because the time lag of thedetection cannot be neglected. In order to ensure the real-time base forthe detection, it is preferred to input and process the sound data ofshort-time repeatedly by dividing the whole sound data into pieces.

At first, a control scheme for realizing the real-time detection will bedisclosed. FIG. 7 is a view showing a control scheme or system of thesound interval detecting apparatus according to the present inventionand illustrates a process which can lead to the detection of the startpoint of the sound segment. Rectangles shown in the figure representprocessings for the subjects to be controlled, wherein width of eachrectangle represents the length of time taken for the relevantprocessing.

Reference numeral 702 designates the sound data input processing carriedout in the sound input unit 103. The input sound is stored in the soundinput unit 103 until a sound buffer of a predetermined time durationbecomes full. At time point when the sound buffer becomes full, aninterrupt signal indicating that the sound buffer is full is sent to thesound processing unit 104. The time duration length or width of therectangle 702 represents the capacity of the sound buffer. In responseto reception of the interrupt signal mentioned above, the soundprocessing unit 104 transfers the data of the sound buffer to the memory109. Reference numeral 703 designates a sound analysis processingcarried out in the sound processing unit 104 by executing the program300. The sound processing unit 104 starts the sound analysis processing703 from the time point when the interrupt signal arrived, to therebyexecute the sound analysis processing until a succeeding interruptsignal is received. Assuming, by way of example, that the time durationlength of the sound buffer mentioned above is set to one second, then atime of one second at maximum can be spent for executing the soundanalysis processing 703. Parenthetically, the time of one second issufficient for executing the sound analysis processing. Further,assuming that Ts is set at 200 msec with Tn being at 500 msec, the startpoint and the end point of sound can be detected by processing twopieces of sound data at maximum. In that case, the time lag involvedfrom the start of inputting to the sound input unit 103 to the detectionof the sound by the sound processing unit 104 can be suppressed to about3 seconds at maximum, which means that the detection can be realizedsubstantially on a real-time basis. The above-mentioned Ts and Tnrepresent lower limits for the lengths of elapsed time in the soundstate and silence state, respectively, as described hereinbefore byreference to FIG. 4, and these numeric values may previously be set withreference to the time duration of one syllable of speech and/or the timeduration of a pause intervening between aural statements. Since theamount of data transferred to the memory 109 is 11 kilobytes when thesampling frequency is set at 11 kHz, the sampling bit number is set at 8bits and the channel number is set to one (monophonic) for the buffercapacity corresponding to one second, there will arise no problemconcerning the time taken for the data transfer.

A flow of processings up to the detection of the start point will beelucidated. When the DETECT button 503 is clicked, a video is firstreproduced by the video reproducing apparatus 110 through an overallcontrol processing, which is then followed by activation of the sounddata input processing 702, preparation for the sound segment detectionprocessing and the start of timer counting of the time spent for theprocessing (701). When the sound data is inputted through the sound datainput processing 702, the data arrival time point T1 is recorded on thememory 109 through the sound analysis processing 703 (704). Further,when the start point of the sound is detected through the sound analysisprocessing, a detection flag on the memory 109 is set “ON” (705). Uponcompletion of the sound analysis processing 703, the detection flag issensed through the overall control processing. When the detection flagis “OFF”, interim result is displayed on the sound waveform monitor 505(706). On the other hand, when the flag is “ON”, the current framenumber is fetched from the video reproducing apparatus 110 with theframe number acquisition time point T2 being obtained from the timer,whereon the frame number and the reading time point mentioned above arestored in the memory 109. Further, by making use of the data arrivaltime point T1 and the frame number acquisition time point T2, theabove-mentioned frame number is converted to the frame number at thetime point at which the sound was started, whereon the frame number nowobtained is stored in the memory 109 (707). In the case where the endpoint of the sound is to be detected in succession, the processings at702 to 707 are executed repetitively until the end point is detected.Since execution of the processings 702 to 707 can be repeated any numberof times, even a plurality of sound segments contained in one videoentity can be detected, respectively.

Next, description will be directed to a method of deriving the framenumber of the start point in the processing 707. It is assumed that thestart point of the sound is contained at a position X in the sound dataobtained through the sound data input processing 708. In that case, thetime point TO of the start point of the sound is estimated from the dataarrival time point T1, the frame number acquisition time point T2 andthe frame number TC2, whereon the frame number TC2 is converted to aframe number TC0 of the start point. This method can be represented bythe following expressions:

T 0=T 1−dT(L−X)/L[msec]  (Eq. 1)

 TC 0=TC 2−1000(T 2−T 0)/30[frame]  (Eq. 2)

where L represents the size of the sound buffer (number of data pieces),and dT represents the time duration of the sound buffer. In the casewhere the sound data is of 8 bits and monophonic, the sound buffer sizeL is nothing but the byte number of the sound buffer. In the expressionEq. 2, denominator “30” means that the number of frames is 30 per secondin the case of the NTSC picture signal. The end point of the sound canequally be determined through a similar procedure.

With the control scheme described above, the start and end points of thesound segment can be detected substantially on a real-time basis.

Next, description will turn to a processing procedure for filteringsuccessively the sound data inputted, being divided. FIG. 8 is a viewfor illustrating positional relationship between the input data and theoutput data in the filtering processing step 303 or 304. Rectanglesshown in the figure represent data arrays, respectively. Morespecifically, 801 designates an input data array (of data number L[pieces]), and 802 designates a filter buffer (data number Lf [pieces]).In the step 303, the filter buffer 802 corresponds to a filter of filtersize 321 in the step 303 while corresponding to a filter of filter size322 in the step 304.

Through the filtering processings in the steps 303 and 304, data of theinput data array 801 are sequentially read out to be inputted to thefilter buffer 802, whereon the maximum value or the minimum value isdetermined from all the data of the filter buffer 802 to be outputted asthe data at a mid position of the filter size. In this case, afragmental output data 803 is obtained from the whole input data of theinput data array 801. Since Lf pieces of the input data of L pieceswhich corresponds to the filter size are used for initialization of thefilter buffer 802, no output data can be obtained from a leading section804 and a trailing section 805 of the output data array. In case thefilter buffer 802 is initialized every time the data is received fromthe sound input unit 103 in the control scheme described hereinbefore byreference to FIG. 7, the envelope will be broken into fragments as aresult of the filtering.

The filter buffer 802 is initialized only once in the start processingstep 701. Thereafter, the filter buffer 802 is held without beingcleared en route so that the position for the input data to be fetchedin succession and the contents of data can be held continuously. Thus,for the (n+1)-th sound analysis processing, Lf pieces of data of thefilter buffer 802 succeeded from the n-th sound analysis processing andL pieces of input data 806 in the (n+1)-th sound analysis processing canbe made use of, whereby L pieces of output data, i.e., a sum of data inthe data sections 805 and 807, can be obtained. In other words, L piecesof output data can be obtained for L pieces of input data, so that thefiltering processing can be performed continuously for the sound datainputted dividedly.

In this conjunction, it should however be noted that the output datacorresponding to the trailing section 805 in the n-th sound analysisprocessing can be obtained only after the input data 806 has beeninputted in the (n+1)-th sound analysis processing. According to thecontrol scheme illustrated in FIG. 7, the data positions X of the startand end points and the input data arrival time point T1 read out fromthe timer are used for computing the frame numbers at the start and endpoints of the sound, as expressed in the expression Eq. 1. For thisreason, two data arrival time points in both the n-th and (n+1)-th soundanalysis processings, respectively, are recorded in the memory 109. Whenthe start and end points of the sound are found in the trailing section805, the arrival time point in the n-th sound analysis processing isused whereas when the start and end points of the sound is found in thedata section 807, the arrival time point in the (n+1)-th sound analysisprocessing is used.

Parenthetically, the filter size Lf may be set at a value which allowsthe difference resulting from subtraction (L−Lf) to be greater thanzero. Basic frequency of voice of human being is generally higher than100 Hz inclusive. Accordingly, by setting the number of data piecescontained in a time period not shorter than 10 msec, (e.g. one frameperiod of 33 msec), inverse of the basic frequency, there will arise noproblem in determining arithmetically the envelope. Incidentally, thenumber of data pieces mentioned above can be determined by multiplyingthe time duration by the sampling frequency.

Through the procedure described above, the detection processing can beexecuted without bringing about discontinuity.

FIG. 9 shows a flow chart for illustrating a processing procedure fordetecting the start and end points of the sound interval in which thecontrol scheme and the filtering scheme described above are reflected,and FIG. 10 shows data structures of the sound data and control datastored in the memory 109.

The flow chart shown in FIG. 9 illustrates a flow of sound intervaldetection processing in which the time duration rules for the CM video(clips) are adopted. A program 900 is a processing program for detectinga pair of the start and end points of the sound segment. This program900 is executed in a step 608. Globally, the program 900 is comprised offour processings. They are (1) processing for detecting the start pointof the sound segment, (2) processing for detecting the end point of thesound segment, (3) decision processing relying on the time durationrules for the CM and (4) detection time limiting processing forterminating the detection process when a prescribed time durationlapses. The processing (1) is executed in steps 902 to 904, and theprocessing (2) is executed in steps 906, 907 and 910. Through theseprocessing steps, control for the processings 703 to 707 shown in FIG. 7is realized. The processing (3) includes a step 905 and steps 911 to915. Through these processing steps, only the sound segment of apredetermined constant time duration can be sieved out. The processing(4) includes steps 908 and 909. Using these processing steps, an errorprocessing is executed when no end point is found within an upper limitimposed on the time duration for executing the detection processing. Itshould however be mentioned that the processings required at least fordetecting the sound interval are the processings (1) and (2). Theprocessings (3) and (4) may be optional.

In the following, individual steps will be described in a sequentialorder.

A step 901 is provided for the initialization processing. The sound dataand the control data to be stored in the memory 109 are initialized,whereon the control processing 701 described previously by reference toFIG. 7 is executed. More specifically, a sound buffer 1030, the sounddata storing ring buffer 1050 and control parameters 1010 areinitialized, and a vacancy flag 1042 for a filter buffer 1040 is set“TRUE”.

In a step 902, decision is made as to the status of start pointdetection for a sound segment. A step 903 is executed until a startpoint flag “IN” 1017 becomes “TRUE”.

In the step 903, the start point of the sound interval is detected. Theprogram 300 is executed, and interim result is displayed on the soundwaveform monitor 505. When the start point is detected, the flag “IN”1017 is set “TRUE”, and the current frame number is read out from thepicture reproducing apparatus 110, and additionally the frame numberacquisition time point T2 is read out from the timer.

In a step 904, the frame number of the start point as detected isarithmetically determined. The time point TO of the start point iscalculated in accordance with the expression Eq. 1, while the framenumber TC0 of the start point is determined in accordance with theexpression Eq. 2. The frame number TC0 of the start point is displayedin the detection result display panel 504 while the flag “IN” is resetto “FALSE”.

In a step 905, decision is made as to the status of detection of thesound interval. Until the sound segment of a predetermined constant timeduration is detected, processing steps described below are executed.

In a step 906, decision is made as to the status of end point detectionfor the sound segment. Steps 907 to 909 are executed until an end pointflag “OUT” 1018 becomes “TRUE”.

In the step 907, the end point of the sound segment is detected. Theprogram 300 is executed, and interim result is displayed on the soundwaveform monitor 505. When the end point is detected, the flag “OUT”1018 is set “TRUE”, and the current frame number is read out from thepicture reproducing apparatus 110 while the frame number acquisitiontime point T2 is read out from the timer. In that case, the frame numberof the end point is arithmetically determined in a step 910.

In the step 908, the time elapsed in the detection processing isdecided. When the time point lapsed from the detection of the startpoint becomes longer than the prescribed detection limit time, it isthen decided that the picture of the proper time duration is notcontained in the picture being processed, whereupon the step 909 isexecuted. The prescribed detection time may set at 60 seconds which istwice as long as the CM time duration of 30 seconds. In case the currentinput data arrival time point T1 1022 satisfies the condition thatT1>T2+60 [sec], where T2 represents the frame number acquisition timepoint in the step 903, decision is then made that the picture of concernis not the one of the proper time duration.

In the step 909, the detection result is discarded, whereupon thedetection processing is intercepted. The start point detected inprecedence is canceled. Further, data inputting from the sound inputunit 103 is stopped, and the picture reproduction in the picturereproducing apparatus 110 is caused to pause with the sound buffer 1030and the filter buffer 1040 being cleared.

In the step 910, the frame number of the end point as detected isarithmetically determined. The time point TO of the end point iscalculated in accordance with the expression Eq. 1, while the framenumber TC0 of the end point is determined in accordance with theexpression Eq. 2. The frame number TC0 of the end point is displayed onthe detection result display panel 504 while the flag “OUT” is reset“FALSE”.

In the step 911, the time duration T of the sound segment is calculated.To this end, difference between the time point of the start pointdetermined in the step 904 and the time point of the end point detectedin the step 910 is determined as T.

In a step 912, decision processing relying on the time duration rulesfor the CM is executed. When the time duration of the sound segment asdetected meets the prescribed constant time duration, steps 913 and 914are executed. By contest, when the prescribed constant time duration isexceeded, a step 915 is executed. Unless the prescribed constant timeduration is met, detection of the end point of a succeeding soundsegment is then resumed. Through this procedure, only the video havingthe sound segment of the prescribed constant time duration can bedetected. In the case now under discussion, since the general rule “CMis so composed as to have the time duration of 15 seconds or 30 secondsper one” is adopted, the prescribed constant time duration is set to be15 seconds or 30 seconds while tolerance is set to be one second for theprescribed constant time duration of 15 seconds with tolerance for theprescribed constant time duration of 30 seconds being set to be 2seconds. However, these values may be altered appropriately independence on practical applications.

In the steps 913 and 914, the detected start and end points are adoptedas the start and end points of the sound interval. The data input fromthe sound input unit 103 is interrupted, and the picture reproduction bythe picture reproducing apparatus 110 is caused to pause while the soundbuffer 1030 and the filter buffer 1040 are cleared.

In the step 915, the result of detection is discarded and the detectionprocessing is interrupted. The detected start and end points arecanceled, and the display on the panel 504 is cleared. Further, the datainputting from the sound input unit 103 is stopped with the picturereproduction by the picture reproducing apparatus 110 being caused topause. The sound buffer 1030 and the filter buffer 1040 are cleared.

Through the procedure described above, only the sound segment of theprescribed constant time duration can be detected.

Finally, description will be directed to data structures of the sounddata and the control data stored in the memory 109. FIG. 10 is a viewshowing examples of the data structure for realizing the sound segmentdetection according to the present invention. Data for the processingare stored in the memory 109 to be read out to the CPU 107 as occasionrequires.

Reference numeral 1000 designates sound signal information, whichcontains a sampling frequency 1001, a sampling bit number 1002 and achannel number 1003 (“1” for the monophonic, “2” for the stereophonic)which are used when the sound signal is digitized in the sound inputunit 103.

Reference numeral 1010 designates control parameters. The variousparameters and flags employed in the sound interval detection processingare stored. Reference numerals 1011 to 1014 designate variableparameters which can be changed on the parameter setting panel 513.Reference numerals 1015 to 1018 designate four flags indicating thestates at the time points when the start and end points of the soundinterval are decided, as described hereinbefore by reference to FIG. 4,and reference numerals 1019 and 1020 designate counters for counting thesound state and the silence state, respectively. The start point flag1017 and the end point flag 1018 are set “FALSE” if the start and endpoints have not yet been detected while they are set “TRUE” when thestart and end points have already been detected. Reference numeral 1021designates the data position X of the start and end points in the inputsound data described hereinbefore by reference to FIG. 7. Referencenumerals 1022 and 1023 designate the data arrival time point T1described hereinbefore by reference to FIG. 8 and the data arrival timepoint in the preceding sound segment detection processing, respectively.By reading out the frame numbers at the time points when it is detectedthat the flags 1017 and 1018 are “ON”, the frame numbers of the startand end points can be arithmetically determined in accordance with theexpressions Eq. 1 and Eq. 2, respectively. The frame numbers of thestart and end points are stored in the memory 109 as well. As analternative, the frame numbers determined arithmetically may be writtenin the auxiliary storage unit 106 in a sequential order. So long as thecapacity of the auxiliary storage unit 106 permits, the sound intervalscan be detected.

The sound buffer 1030 shows a data structure of a buffer which storesthe processing data 311 to 315 transferred among the individual steps ofthe program 300. On the memory 109, there are prepared three buffers forthe input, work and the output, respectively. The buffer size 1031 ofthese buffers are all set to a same value. Data number 1032 representsthe number of data pieces stored in a relevant buffer 1030. As describedhereinbefore by reference to FIG. 8, since the output data for theleading section 804 and the trailing section 805 cannot be obtained withonly the first input buffer data, the data number of the output bufferdecreases. Accordingly, the data number 1032 is prepared in addition tothe buffer size 1031. Reference numeral 1033 designates processing data,i.e., data for the processings.

The filter buffer 1040 is realized in a data structure for a ring bufferemployed for the maximum/minimum type filtering in the steps 303 and304. In this conjunction, there are prepared on the memory 109 two datasets for the MAX filtering and the MIN filtering. The buffer size 1041is arithmetically determined from the filter time duration TLf 1012. Thevacancy flag 1042 indicates the initialized state of the filter buffer.The vacancy flag is set “TRUE” in the initialized state, where thefilter buffer is vacant. On the other hand, once the filter buffer isfilled with data, the vacancy flag is set “FALSE”. When the vacancy flag1042 is “TRUE” at the time when processing is performed on the inputsound buffer 1030, initialization is achieved by copying the input databy a proportion equivalent to the size 1041. By contrast, when thevacancy flag is “FALSE”, no initialization is performed. In this way,the envelope can be arithmetically determined without being accompaniedwith discontinuity. Reference numeral 1043 designates an offsetindicating the position at which the succeeding input data is to befetched. Reference numeral 1044 designates the input data fetched whichrepresents the data to be subjected to the filtering processing.

Reference numeral 1050 designates a sound data storing ring buffer forcopying the sound data inputted from the sound input unit 103 to therebyhold constantly the sound data by an amount corresponding to pastseveral seconds. The data stored in the sound data storing ring buffer1050 is used for displaying the sound data waveform 507 and reproducingthe sound with the PLAY button 509. Reference numeral 1051 designatesthe buffer size. By selecting the buffer size 1051 to be an integralmultiple of the buffer size 1031, copying can be easily carried out.Reference numeral 1052 designates a data position on the ring bufferwhich corresponds to the data position X of the start point of the soundinterval described hereinbefore by reference to FIG. 7. Similarly,reference numeral 1053 designates a data position on the ring bufferwhich corresponds to the end point. Initially, values smaller than zeroare set at the data positions 1052 and 1053 to be subsequently replacedby the values at the data position in accordance with the detection ofthe start and end points. Reference numeral 1054 designates an offsetindicating the leading position of the location at which the succeedinginput data is to be copied. Reference numeral 1055 designates the sounddata.

Now, memory size for the data used in the sound segment detectionprocessing will be estimated. Assuming, by way of example, that thesound signal information 1000 is monophonic sound data of 11 kHz and 8bits and that the time duration which allows the sound data to berecorded in the input buffer is 1 second, the memory size demanded forthe sound buffer 1030 is on the order of 11 kilobytes, and the total sumof the capacities of three buffers is on the order of 33 kilobytes.Assuming that the time duration for storing the sound is 40 seconds, thecapacity required for the sound data storing ring buffer 1050 is on theorder of 440 kilobytes. Assuming that the filter time duration is 30msec., the capacity required for the filter buffer 1040 is on the orderof 0.3 kilobytes. Thus, even a sum of capacities of two filter buffersis short of 1 kilobyte. For these reasons, the method according to thepresent invention can be carried out satisfactorily even by using aninexpensive computer whose memory size is relatively small.

With the arrangement taught by the present invention, the presence orabsence of the sound which has heretofore been judged auditorily can bedetected quantitatively and automatically, providing the effect that theman power involved in the sound segment detecting work can be reduced.It is sufficient for the operator to place a CM material in the picturereproducing apparatus and manipulate the buttons on the screen of thesound processing apparatus. Besides, in the manipulation, suchcomplicated manipulations as video reproduction, pause or stopping andreverse reproduction as well as frequent repetition thereof are renderedunnecessary, to an advantageous effect in that the manipulation can besimplified. Furthermore, owing to such arrangement that the sound signalis inputted, being divided into shorter time intervals, the soundsegment can be detected on a real-time basis, which is effective forenhancing the work efficiency. With regard to the confirmation work,because the sound in the sound segment as detected is displayed in theform of the waveforms and played, the result of detection can beinstantaneously observed or confirmed visually and auditorily, which isadvantageous from the view point of reduction of the man power involvedin the confirmation work. Besides, owing to such arrangement that thesound segment can be detected by making use of the time duration rulesfor the CM video, improper material which is too lengthy or short can becanceled or discarded, there arises no necessity of inspectingadditionally the time duration of the CM video. Furthermore, by virtueof such arrangement that margins can be affixed to the sound segment asdetected, the CM videos (clips) of high quality which suffersessentially no dispersion in the time duration can be registered in themanaging apparatus, which is advantageous from the standpoint ofenhancing the quality of the registered videos.

Further, the filtering processing of the present invention which isemployed for the arithmetic determination of the envelope can be carriedout with a computer of a small scale such as a personal computer becauseof less overhead involved in computation when compared with computationof power spectra. Thus, the present invention provides such effect thatthe computation can be performed even when the sampling rate for thesound signal input is high.

The apparatus for carrying out the method of detecting the sound segmentin the video can be realized by a small-scale computer such as apersonal computer, whereby the detecting apparatus can be realizedinexpensively.

INDUSTRIAL UTILIZABILITY

As is apparent from the foregoing description, the method and theapparatus for detecting the sound segments according to the teachings ofthe present invention is suited for application to a CM registeringapparatus for registering CM clip constituted by video and audio bydetecting the start point and the end point thereof.

Furthermore, the method and apparatus for detecting the sound segmentsaccording to the present invention can be made use of as a CM detectingapparatus for detecting an interval of a CM video inserted in a movieand a TV program.

What is claimed is:
 1. A method of detecting start and end points of asound segment in a video, comprising: receiving a sound signal recordedin a video program; determining an envelope of a waveform of the soundsignal; and detecting one of a start point and an end point of anindividual sound segment from the sound signal, at a time point at whichsaid envelope intersects a preset threshold value for a sound level ofthe sound segment.
 2. A method as claimed in claim 1, wherein a lowerlimit for the length of an elapsed time of a silence state is set, suchthat the time point at which said envelope intersects the thresholdvalue for the sound level is detected as the start point or the endpoint of the sound segment when the elapsed time during which the valueof the waveform envelope of the sound signal has remained smaller thanthe threshold value of said sound level is longer than said lower limit.3. A method as claimed in claim 1, wherein a lower limit for the lengthof an elapsed time of a sound state is set previously, such that thetime point at which said envelope intersects the threshold value for thesound level is detected as the start point or the end point of the soundsegment when the elapsed time during which the value of the waveformenvelope of the sound signal has exceeded the threshold value of saidsound level is longer than said lower limit.
 4. A method as claimed inclaim 1, wherein the envelope of the waveform of the sound signal isarithmetically determined by filtering of the sound signal for apredetermined duration on a time-serial basis.
 5. A method as claimed inclaim 4, wherein the sound signal is filtered, via a maximum valuefilter for determining sequentially maximum values of the sound signalfor a predetermined duration, and via a minimum value filter fordetermining sequentially minimum values of the sound signal for saidpredetermined duration.
 6. A method as claimed in claim 1, wherein thethreshold value of the sound level is set using the sound signalindicating a silence for several seconds without reproducing the video,and a maximum value of the sound level of noise.
 7. An apparatus fordetecting start and end points of a sound segment in a video,comprising: a video reproducing device to reproduce a video from astorage medium and to stop a video at a desired position designated by auser; a sound input unit to produce a sound signal recorded on an audiotrack of the video reproduced from the video reproducing device; and asound processing unit to process the sound signal, including todetermine start and end points of a sound segment from the sound signal,said sound processing unit comprising: envelope arithmetic means fordetermining arithmetically an envelope of a waveform of the soundsignal; threshold value setting means for setting a threshold value of asound level for values of said envelope; start/end point detecting meansfor detecting a time point at which said threshold value of the soundlevel and said envelope intersects each other as a start point or an endpoint of the sound segment; frame position determining means fordetermining a frame position of the video at a time point at which thestart point or the end point of the sound segment is detected; anddisplay means for displaying the frame position of the start point orthe end point of the sound segment.
 8. An apparatus as claimed in claim7, wherein said frame position determining means comprises: timer meansfor counting the elapsed time, starting from the start of the detectionprocessing, means for reading out the frame position of the video,elapsed time storage means for storing elapsed time at a time point atwhich the start point or the end point of the sound signal is detectedand the elapsed time at a time point at which said frame position isread out, and frame position correcting means for correcting the frameposition as read out by using difference between both the elapsed times.9. An apparatus as claimed in claim 7, wherein said sound processingunit further comprises means for stopping reproduction of the video atthe frame positions corresponding to the start and end points of thesound segment.
 10. An apparatus for detecting start and end points of asound segment in a video, comprising: a video reproducing device toreproduce a video and to stop a video at a desired position designatedby a user; a sound input unit to produce a sound signal recorded on anaudio track of the video; and a sound processing unit to process thesound signal, including to determine start and end points of a soundsegment from the sound signal, said sound processing unit comprising:envelope arithmetic means for determining arithmetically an envelope ofa waveform of the sound signal, threshold value setting means forsetting previously a level of threshold for values of said envelope,start point detecting means for detecting as a start point of a soundsegment a time point at which said envelope exceeds the level of saidthreshold, end point detecting means for detecting as an end point ofthe sound segment a time point at which said envelope falls below thelevel of said threshold, frame position determining means fordetermining frame positions of the video at time points at which saidstart point and said end point are detected, respectively, frameposition storage means for storing individually the frame positions ofsaid start point and said end point of the sound segment, and displaymeans for displaying individually said frame positions of said startpoint and said end point, to thereby display the frame positions of saidstart point and said end point of the sound segment.
 11. An apparatus asclaimed in claim 10, wherein said sound processing unit includes buffermemory means for storing the sound signal inputted on a time-serialbasis, and that when the start point and the end point of the soundsegment are detected, a waveform in the sound segment is displayed onsaid display means.
 12. An apparatus as claimed in claim 10, whereinsaid sound processing unit includes reproducing means for reproducingthe sound signal in the sound segment at the time points when the soundsignal as well as the start point and the end point of the sound segmentare detected.
 13. An apparatus as claimed in claim 10, wherein saidsound processing unit includes time duration length setting means forsetting an upper limit of a predetermined duration of the sound segmentand a tolerance range, and time duration comparison means for comparinga detected duration extending from the start point to the end point ofthe sound segment as detected with a set duration, and that when saiddetected duration is shorter when compared with said set duration, thesucceeding end point of the sound segment is detected while holding thestart point of the sound segment, whereas when said detected duration islonger when compared with said set duration, detection is terminatedwith result of the detection being discarded, while when said detectedduration falls within the tolerance range of sound data, the detectionis intercepted with the result of the detection being held and thedetection is terminated unless the end point of the sound segment isdetected even when said detected duration exceeds a time duration twiceas long as said set duration.
 14. An apparatus as claimed in claim 13,wherein the upper limit of the predetermined duration of the soundsegment is set to be 15 seconds or 30 seconds, the tolerance range is ofone or two seconds, and that the video subjected to the detectionprocessing is a commercial video clip.
 15. An apparatus as claimed inclaim 13, wherein said sound processing unit includes margin settingmeans for setting margins at a front side in precedence to the startpoint of the sound segment and at a rear side in succession to the endpoint of the sound segment, respectively, and that when said detectedduration of the sound segment falls within said tolerance range of saidset duration, results of shifting the detected start point and thedetected end point frontwards and rearwards, respectively, aredetermined as the start point and the end point, respectively, of thesound segment.
 16. A method of detecting start and end points of a videoassociated with a sound segment, comprising: receiving a video signalhaving a sound signal; determining an envelope of a waveform of thesound signal; and detecting a start point of a sound segment on thebasis of continuity of a silence segment in the waveform of the soundsignal, and an end point of the sound segment on the basis of a fallingpoint of the sound segment.
 17. A method as claimed in claim 16, whereinframes constituting the video are derived from the video signal to bedisplayed at a predetermined time interval on a time-serial basis, thewaveform representing the sound signal and a display bar representingsaid video frame interval are displayed along with said frame display onthe time-serial basis, and that frame numbers of the start point or theend point of said video frame interval are set again by modifying saidvideo frame interval bar along a time axis on display.
 18. A method asclaimed in claim 17, wherein the start point or the end point of thesound segment is determined at a time point at which a preset thresholdvalue of a sound level of the sound segment and said envelope intersecteach other.
 19. A method of detecting audio segments in a video clip,comprising: receiving audio data associated with a video clip; obtaininga waveform of the audio data; determining an envelope of the waveform ofthe audio data using maximum and minimum value filters; making acomparison between the audio data within the envelope and a thresholdvalue preset for an audio level; and detecting a start point and an endpoint of each audio segment in the video clip based on said comparison.20. A method as claimed in claim 19, wherein the start point or the endpoint of an audio segment in the video clip is detected at a time pointat which the audio data within the envelope intersects the thresholdvalue preset for the audio level.
 21. A method as claimed in claim 20,wherein the start point of the audio segment is detected at the timepoint when an audio state has lasted longer than a first time durationdesignated for the audio state, after a silence state lasted longer thana second time duration designated for the silence state; and wherein theend point of the audio segment is detected at the time point when thesilence state has lasted longer the second time duration, after theaudio state lasted longer than the first time duration.
 22. A method asclaimed in claim 19, wherein the audio data is filtered, via the maximumvalue filter, to determine sequentially maximum values of the audio datafor a predetermined duration, and via the minimum value filter, todetermine sequentially minimum values of the audio data for thepredetermined duration on a time-serial basis.
 23. An apparatus fordetecting audio segments in a video clip, comprising: a video playbackarranged to reproduce a video clip from a storage medium; a sound inputunit arranged to separate audio data associated with the video clipreproduced from the video playback; a display unit; and a soundprocessor unit coupled to receive the audio data associated with thevideo clip, and configured to perform the following: obtain a waveformof the audio data; determine an envelope of the waveform of the audiodata; make a comparison between the audio data within the envelope and athreshold value preset for an audio level; and detect a start point andan end point of each audio segment in the video clip based on saidcomparison; and provide a visual display of the start point and the endpoint of each audio segment in the video clip on said display unit. 24.An apparatus as claimed in claim 23, wherein the sound processor unit isconfigured to detect the start point or the end point of an audiosegment in the video clip at a time point at which the audio data withinthe envelope intersects the threshold value preset for the audio level.25. An apparatus as claimed in claim 24, wherein the sound processorunit is configured to detect the start point of the audio segment at thetime point when an audio state has lasted longer than a first timeduration designated for the audio state, after a silence state lastedlonger than a second time duration designated for the silence state; andto detect the end point of the audio segment at the time point when thesilence state has lasted longer the second time duration, after theaudio state lasted longer than the first time duration.
 26. An apparatusas claimed in claim 23, wherein the sound processor unit comprisesmaximum and minimum value filters such that audio data is filtered, viathe maximum value filter, to determine sequentially maximum values ofthe audio data for a predetermined duration, and via the minimum valuefilter, to determine sequentially minimum values of the audio data forthe predetermined duration on a time-serial basis.